AITopics | language identification model

Collaborating Authors

language identification model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Validating and Exploring Large Geographic Corpora

Dunn, Jonathan

arXiv.org Artificial IntelligenceMar-12-2024

This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations.

corpora, corpus, geographic corpora, (10 more...)

arXiv.org Artificial Intelligence

2403.08198

Country:

Oceania > New Zealand (0.24)
South America > Colombia (0.05)
South America > Chile (0.04)
(73 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Transportation > Passenger (0.46)
Transportation > Marine (0.46)
Consumer Products & Services > Travel (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

A Survey on Data Selection for Language Models

Albalak, Alon, Elazar, Yanai, Xie, Sang Michael, Longpre, Shayne, Lambert, Nathan, Wang, Xinyi, Muennighoff, Niklas, Hou, Bairu, Pan, Liangming, Jeong, Haewon, Raffel, Colin, Chang, Shiyu, Hashimoto, Tatsunori, Wang, William Yang

arXiv.org Artificial IntelligenceMar-8-2024

A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

alignment performance efficiency, function and selection mechanism, language identification model, (17 more...)

arXiv.org Artificial Intelligence

2402.16827

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > San Francisco County > San Francisco (0.13)
North America > United States > New York > New York County > New York City (0.04)
(26 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Education (1.00)
Health & Medicine (0.67)
Leisure & Entertainment > Sports (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Soldaini, Luca, Kinney, Rodney, Bhagia, Akshita, Schwenk, Dustin, Atkinson, David, Authur, Russell, Bogin, Ben, Chandu, Khyathi, Dumas, Jennifer, Elazar, Yanai, Hofmann, Valentin, Jha, Ananya Harsh, Kumar, Sachin, Lucy, Li, Lyu, Xinxi, Lambert, Nathan, Magnusson, Ian, Morrison, Jacob, Muennighoff, Niklas, Naik, Aakanksha, Nam, Crystal, Peters, Matthew E., Ravichander, Abhilasha, Richardson, Kyle, Shen, Zejiang, Strubell, Emma, Subramani, Nishant, Tafjord, Oyvind, Walsh, Pete, Zettlemoyer, Luke, Smith, Noah A., Hajishirzi, Hannaneh, Beltagy, Iz, Groeneveld, Dirk, Dodge, Jesse, Lo, Kyle

arXiv.org Artificial IntelligenceJan-31-2024

Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2402.00159

Country:

North America > Canada (0.46)
Asia > Middle East > UAE (0.14)
Europe > Ireland (0.14)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The first neural machine translation system for the Erzya language

Dale, David

arXiv.org Artificial IntelligenceSep-19-2022

We present the first neural machine translation system for translation between the endangered Erzya language and Russian and the dataset collected by us to train and evaluate it. The BLEU scores are 17 and 19 for translation to Erzya and Russian respectively, and more than half of the translations are rated as acceptable by native speakers. We also adapt our model to translate between Erzya and 10 other languages, but without additional parallel data, the quality on these directions remains low. We release the translation models along with the collected text corpus, a new language identification model, and a multilingual sentence encoder adapted for the Erzya language. These resources will be available at https://github.com/slone-nlp/myv-nmt.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2209.09368

Country:

Asia > Russia (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
Europe > Russia > Volga Federal District > Republic of Mordovia > Saransk (0.04)
(10 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

What's in a Name?

Konstantopoulos, Stasinos

arXiv.org Artificial IntelligenceOct-8-2007

This paper describes experiments on identifying the language of a single name in isolation or in a document written in a different language. A new corpus has been compiled and made available, matching names against languages. This corpus is used in a series of experiments measuring the performance of general language models and names-only language models on the language identification task. Conclusions are drawn from the comparison between using general language models and names-only language models and between identifying the language of isolated names and the language of very short document fragments. Future research directions are outlined.

artificial intelligence, natural language, text processing, (20 more...)

arXiv.org Artificial Intelligence

0710.1481

Country:

Europe (1.00)
North America > United States (0.28)
North America > Canada (0.28)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)

Add feedback